feat(rust/sedona-spatial-join-gpu): Add GPU-accelerated spatial join support #465

zhangfengcdt · 2025-12-17T17:51:09Z

This commit introduces GPU-accelerated spatial join capabilities to SedonaDB, enabling significant performance improvements for large-scale spatial join operations.

Key changes:

Add new sedona-spatial-join-gpu crate that provides GPU-accelerated spatial join execution using CUDA via the sedona-libgpuspatial library.
Implement GpuSpatialJoinExec execution plan with build/probe phases that efficiently handles partitioned data by sharing build-side data across probes.
Add GPU backend abstraction (GpuBackend) for geometry data transfer and spatial predicate evaluation on GPU.
Extend the spatial join optimizer to automatically select GPU execution when available and beneficial, with configurable thresholds and fallback to CPU.
Add configuration options in SedonaOptions for GPU spatial join settings including enable/disable, row thresholds, and CPU fallback behavior.
Include comprehensive benchmarks and functional tests for GPU spatial join correctness validation against CPU reference implementations.

c/sedona-libgpuspatial/build.rs

c/sedona-libgpuspatial/CMakeLists.txt

zhangfengcdt · 2026-01-05T23:52:02Z

Will merge the patch to fix the failing example build once this is merged - #486

paleolimbot

Thank you for working on this!

In addition to specific comments, I'm concerned about the proliferation of conditional compilation (partiuclarly within sedona-spatial-join, which is a rather important part of our engine to keep clean).

At a high level what sedona-spatial-join-gpu is doing is more like sedona-spatial-join-extension: it provides a simpler (FFI-friendly) mechanism to inject a join operator without dealing with the DataFusion-y details. I think most of the conditional compilation/dead code/unused/ignore directives could be avoided if we add a CPU join extension and use that for all the tests. The GPU extension itself would then be a runtime implementation detail (eventually loaded at runtime via FFI).

paleolimbot · 2026-01-06T02:43:58Z

rust/sedona-spatial-join-gpu/benches/gpu_spatial_join.rs

+// Helper execution plan that returns a single pre-loaded batch
+struct SingleBatchExec {
+    schema: Arc<Schema>,
+    batch: RecordBatch,
+    props: datafusion::physical_plan::PlanProperties,
+}


This seems very similar to SessionContext::register_batch() and is a lot of lines of code. Do we need this?

This file has been removed

paleolimbot · 2026-01-06T02:51:56Z

rust/sedona-spatial-join-gpu/benches/gpu_spatial_join.rs

+/// Generate random points within a bounding box
+fn generate_random_points(count: usize) -> Vec<String> {
+    use rand::Rng;
+    let mut rng = rand::thread_rng();
+    (0..count)
+        .map(|_| {
+            let x: f64 = rng.gen_range(-180.0..180.0);
+            let y: f64 = rng.gen_range(-90.0..90.0);
+            format!("POINT ({} {})", x, y)
+        })
+        .collect()
+}


We have a random geometry generator in sedona-testing (that is used in the non-GPU join tests and elsewhere) that I think we should be using here!

@zhangfengcdt Hi Feng, do we still need this benchmark program? It compares to a brute-force CPU implementation.

paleolimbot · 2026-01-06T02:55:01Z

rust/sedona-spatial-join-gpu/src/config.rs

+                sedona_libgpuspatial::SpatialPredicate::Intersects,
+            ),
+            device_id: 0,
+            batch_size: 8192,


Should this be Option<usize> so that it can default to the datafusion.batch_size setting?

batch_size does not exist anymore. To fully exploit the parallelism of the GPU, users need to manually set datafusion.execution.batch_size to a very large number. I want to overwrite this value when the GPU feature is enabled, but haven't figured out the right place to insert this logic.

paleolimbot · 2026-01-06T02:57:01Z

rust/sedona-spatial-join-gpu/src/exec.rs

+        let properties = PlanProperties::new(
+            eq_props,
+            partitioning,
+            EmissionType::Final, // GPU join produces all results at once


Just checking that this is correct (I thought that because one side is streaming the output might be incremental?)

This file has been completely rewritten.

paleolimbot · 2026-01-06T02:58:51Z

rust/sedona-spatial-join-gpu/src/gpu_backend.rs

+/// GPU backend for spatial operations
+#[allow(dead_code)]
+pub struct GpuBackend {
+    device_id: i32,
+    gpu_context: Option<GpuSpatialContext>,
+}
+
+#[allow(dead_code)]
+impl GpuBackend {


Can these dead code markers be removed?

This file has been removed.

paleolimbot · 2026-01-06T03:15:53Z

rust/sedona-spatial-join-gpu/tests/gpu_functional_test.rs

+    let kernels = scalar_kernels();
+    let sedona_type = SedonaType::Wkb(Edges::Planar, lnglat());
+
+    let _cpu_testers: std::collections::HashMap<&str, ScalarUdfTester> = [


Is there a reason this variable is not used / can we do this using a for loop to avoid this indirection?

This test was incomplete, but it has now been completed.

paleolimbot · 2026-01-06T03:31:39Z

rust/sedona-spatial-join/src/exec.rs

+
+    #[cfg(feature = "gpu")]
+    #[tokio::test]
+    #[ignore] // Requires GPU hardware


We need to figure out a way to not ignore tests in this repo (in this case I think these tests shouldn't exist if the gpu feature isn't enabled so we shouldn't need the ignore it?)

paleolimbot · 2026-01-06T03:34:17Z

rust/sedona-spatial-join/src/optimizer.rs

+                    SpatialRelationType::Intersects => LibGpuPred::Intersects,
+                    SpatialRelationType::Contains => LibGpuPred::Contains,
+                    SpatialRelationType::Covers => LibGpuPred::Covers,
+                    SpatialRelationType::Within => LibGpuPred::Within,
+                    SpatialRelationType::CoveredBy => LibGpuPred::CoveredBy,
+                    SpatialRelationType::Touches => LibGpuPred::Touches,
+                    SpatialRelationType::Equals => LibGpuPred::Equals,


Can we move SpatialRelationType to sedona-geometry or sedona-common to avoid two copies?

I have extracted SpatialRelationType into a separate file under sedona-geometry. I involves some changes in sedona-spatial-join package, I'd like to ask @Kontinuation whether this change is appropriate

We have a unified spatial-join package. This piece of code has been removed

paleolimbot · 2026-01-06T03:43:24Z

c/sedona-s2geography/s2geography

git submodule update --recursive should remove this diff

paleolimbot · 2026-01-06T03:44:12Z

python/sedonadb/Cargo.toml

 default = ["mimalloc"]
 mimalloc = ["dep:mimalloc", "dep:libmimalloc-sys"]
 s2geography = ["sedona/s2geography"]
+gpu = ["sedona/gpu"]


Because we don't have any tests in Python for this feature I suggest leaving this out for now (a follow-up PR could add Python support + a test)

I have removed this feature.

pwrliang · 2026-01-16T02:21:19Z

Hi @zhangfengcdt @paleolimbot, the GPU-based spatial join has been completely rewritten.

The initial version wrapped both the filtering and refinement stages into the same interface (C library). However, that approach suffered from build-side parsing overhead because it could not be shared across partitions.

I have redesigned it to match the CPU-based join, where the two stages are separated. This version runs faster and takes less memory. It also supports WHERE clauses (which the initial version did not). The implementation is based on @Kontinuation 's spatial-join, so please give me some advice if you can.

I realize this PR is large, but the structural changes were necessary. Please review it when you have time. Thanks!

Kontinuation · 2026-01-16T03:05:20Z

The modifications made to the existing CPU spatial join code is pretty trivial. If I understand it correctly, it has only moved pub enum SpatialRelationType to another module, and generate GPU join physical plans when the GPU feature is enabled. This part looks good to me.

I still need to take some time looking into sedona-spatial-join-gpu. It is enormous amount of code.

Kontinuation · 2026-01-16T03:09:01Z

python/sedonadb/Cargo.toml

 tokio = { workspace = true }
 mimalloc = { workspace = true, optional = true }
 libmimalloc-sys = { workspace = true, optional = true }
+env_logger = { workspace = true }


I favored it a lot. I had it when testing my local out-of-core spatial join branch.

Kontinuation

The GPU spatial join code is mostly the same as the CPU code. The difference is mainly in the spatial index.

Notably, GPU index performs filters and refinements in batches, while the current spatial index in CPU spatial join works with one probe geometry at a time. I have already refactored the CPU spatial index to work with batches for better performance: https://github.com/Kontinuation/sedona-db/blob/485d45fb95fd278b47253bfc2cb1f3fd93798075/rust/sedona-spatial-join/src/index/spatial_index.rs#L437-L460. This will be submitted to upstream soon, and I believe that it is quite feasible to unify the code for CPU and GPU based spatial join.

I am fine with duplicating code for GPU spatial join, we can do some refactoring to unify them later on.

Kontinuation · 2026-01-16T09:25:03Z

c/sedona-libgpuspatial/src/lib.rs

+            index.create_context(&mut ctx);

-            // Get results
-            let build_indices = joiner.get_build_indices_buffer(context).to_vec();
-            let stream_indices = joiner.get_stream_indices_buffer(context).to_vec();
+            // Push stream data (probe side) and perform join
+            unsafe {
+                index.probe(&mut ctx, rects.as_ptr() as *const f32, rects.len() as u32)?;
+            }

-            Ok((build_indices, stream_indices))
+            // Get results
+            let build_indices = index.get_build_indices_buffer(&mut ctx).to_vec();
+            let probe_indices = index.get_probe_indices_buffer(&mut ctx).to_vec();
+            index.destroy_context(&mut ctx);
+            Ok((build_indices, probe_indices))


ctx will leak when index.probe fails.

Kontinuation · 2026-01-16T09:39:27Z

rust/sedona-spatial-join-gpu/src/index/spatial_index_builder.rs

+        self.build_batch.batch = arrow::compute::concat_batches(&schema, all_record_batches)
+            .map_err(|e| {
+                DataFusionError::Execution(format!("Failed to concatenate left batches: {}", e))
+            })?;


I wonder if it is necessary to concatenate all the batches and gs.push_build only once. It would be easier for us to unify the CPU and GPU based spatial join if we support push_build multiple times.

There might be some performance implications switching to multiple gc.push_build calls. If it is not feasible, we need to be aware that concatenating batches doubles the memory requirement, so we need to reserve more memory for building GPU indexes.

It is feasible to call push_build many times, but it may slightly increase the build time. I think it is worth making this change to unify the code.

I have added an option that enables/disables the concatenation operations.

pwrliang · 2026-01-19T17:45:39Z

Hi @Kontinuation, I have merged the GPU code path into sedona-spatial-join, so there's no duplicated code issue. Could you review the changes to the spatial-join package?

Kontinuation

The overall approach looks good to me.

Kontinuation · 2026-01-20T02:29:42Z

rust/sedona-spatial-join/src/index/spatial_index.rs

+pub(crate) trait SpatialIndexFull: SpatialIndex + SpatialIndexInternal {}

-        let index = builder.finish().unwrap();
+impl<T> SpatialIndexFull for T where T: SpatialIndex + SpatialIndexInternal {}


I prefer defining the SpatialIndex trait as one rather than splitting it into external and internal interfaces. At least for now SpatialIndex is only intended to be used by spatial join so we don't need to define external interfaces for it.

I have fixed this. Now, it only has a unified interface.

Kontinuation · 2026-01-20T02:48:57Z

rust/sedona-spatial-join/src/exec.rs

+        let use_gpu = cfg!(feature = "gpu");
+
        let mark_exec = SpatialJoinExec::try_new(
            original_exec.left.clone(),
            original_exec.right.clone(),
            original_exec.on.clone(),
            original_exec.filter.clone(),
            &join_type,
            None,
+            use_gpu,


Do we still run CPU spatial join tests when GPU feature is enabled?

I initially thought we would run tests twice with and without --features gpu. This can eliminate duplicated code. However, I think it is more natural to run both CPU and GPU tests when using --features gpu. I have updated the tests accordingly.

Kontinuation · 2026-01-20T02:51:03Z

rust/sedona-spatial-join/src/index/gpu_spatial_index_builder.rs

+    }
+    /// Build visited bitmaps for tracking left-side indices in outer joins.


We should add empty lines in between methods. I'm not sure if there's a cargo fmt option for automatically doing this.

…support This commit introduces GPU-accelerated spatial join capabilities to SedonaDB, enabling significant performance improvements for large-scale spatial join operations. Key changes: - Add new `sedona-spatial-join-gpu` crate that provides GPU-accelerated spatial join execution using CUDA via the `sedona-libgpuspatial` library. - Implement `GpuSpatialJoinExec` execution plan with build/probe phases that efficiently handles partitioned data by sharing build-side data across probes. - Add GPU backend abstraction (`GpuBackend`) for geometry data transfer and spatial predicate evaluation on GPU. - Extend the spatial join optimizer to automatically select GPU execution when available and beneficial, with configurable thresholds and fallback to CPU. - Add configuration options in `SedonaOptions` for GPU spatial join settings including enable/disable, row thresholds, and CPU fallback behavior. - Include comprehensive benchmarks and functional tests for GPU spatial join correctness validation against CPU reference implementations.

…pendencies

…support This commit introduces GPU-accelerated spatial join capabilities to SedonaDB, enabling significant performance improvements for large-scale spatial join operations. Key changes: - Add new `sedona-spatial-join-gpu` crate that provides GPU-accelerated spatial join execution using CUDA via the `sedona-libgpuspatial` library. - Implement `GpuSpatialJoinExec` execution plan with build/probe phases that efficiently handles partitioned data by sharing build-side data across probes. - Add GPU backend abstraction (`GpuBackend`) for geometry data transfer and spatial predicate evaluation on GPU. - Extend the spatial join optimizer to automatically select GPU execution when available and beneficial, with configurable thresholds and fallback to CPU. - Add configuration options in `SedonaOptions` for GPU spatial join settings including enable/disable, row thresholds, and CPU fallback behavior. - Include comprehensive benchmarks and functional tests for GPU spatial join correctness validation against CPU reference implementations.

pwrliang · 2026-01-21T03:07:35Z

Here's the evaluation result from a variant of Sedona-SpatialBench

Each query was executed 5 times with one round of warmup to cache datasets in buff/cache, and the average times are reported. Using GPU join is faster for queries with heavy joins, such as Q10 and Q11.

Commit: b690b12
Machine: AWS g5.2xlarge (NVIDIA A10)

Mode	Scale Factor	Query	Time (s)
gpu	sf_1	q2	4.721
gpu	sf_1	q4	1.293
gpu	sf_1	q8	0.413
gpu	sf_1	q9	0.041
gpu	sf_1	q10	4.396
gpu	sf_1	q11	6.773
gpu	sf_10	q2	8.036
gpu	sf_10	q4	2.01
gpu	sf_10	q8	1.78
gpu	sf_10	q9	0.144
gpu	sf_10	q10	9.092
gpu	sf_10	q11	17.378

Commit: e106ce6
Machine: m7i.2xlarge

Mode	Scale Factor	Query	Time (s)
cpu	sf_1	q2	4.732
cpu	sf_1	q4	0.742
cpu	sf_1	q8	0.489
cpu	sf_1	q9	0.021
cpu	sf_1	q10	5.126
cpu	sf_1	q11	8.625
cpu	sf_10	q2	31.165
cpu	sf_10	q4	1.249
cpu	sf_10	q8	4.53
cpu	sf_10	q9	0.062
cpu	sf_10	q10	28.368
cpu	sf_10	q11	44.737

jiayuasu · 2026-01-21T05:41:30Z

@pwrliang Great work!

paleolimbot

Thank you for working on this! This is a substantial improvement on the initial work (which was also a substantial improvement!). In particular I am grateful for your work to ensure that the integration with the existing sedona-spatial-join code.

I think this would work best as at least two PRs:

Update the C++ library portion with the new approach (I left review comments on the C++ portion), since this is nicely contained
Add a Rust wrapper around gpuspatial_c.h with a few "is it plugged in" tests
Integrate with sedona-spatial-join

In addition to being easier for both the community and LLMs to review, @Kontinuation has rather heroically split a very large change implementing larger-than-memory spatial joins into small PRs and splitting out the C library change and the wrapper first may give him a few extra days to merge or align conflicting pieces.

This C API is great and is much better than what we initially came up with when we started this project! Now that it has been more thought out, I left some comments to perhaps help polish it a bit. It would help to ensure every member has docstrings now that you've nicely showed that it works!

I'm not a CUDA expert and I'll have to rely on copilot to review that bit when it's in a dedicated PR. It seems like you have great test coverage and were kind to ensure the licensing and TODOs were all handled.

paleolimbot · 2026-01-21T15:14:02Z